Explore and Summarize Red Wine Quality Data within R

By Ehsan Jafari-Shirazi

Introduction

The goal of this project is to obtain and quantify how chemical properties impact the quality Grade of red wine. There are 1599 rows (red wine samples) and 11 variables in the dataset. The wine samples in the dataset are related to red variants of the Portuguese “Vinho Verde” wine, and the variables describe the physicochemical properties of wine. A multiple regression analysis is conducted to identify if and how the 11 independent variables can be used in the model to explain the variation of the quality Grade of a red wine.

Univariate Plots Section

First, Some Preliminary explorations are performed:

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

In the above bocplots, the red star representing the mean, while the middle blue line representing the median. Comparig the mean and the median, as well as the histogram on the right, we can see that whenever the data is normally distributed the mean and the median are converging (e.g. pH or density), whereas when the data is skewed the mean and the median are apart (e.g. sulphates or total.sulfur.dioxide). Using the boxplots is also helpful in identifying the outliers which are the dotted points at either sides (Up or Down) of the boxplot tails. Looking through the above boxplots and histograms for each variable, four variables of Fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide appear to have the largest outliers. Therefore, I decided to slice them from their top 1% values.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   :0.900  
##  1st Qu.: 7.100   1st Qu.:0.3950   1st Qu.:0.0900   1st Qu.:1.900  
##  Median : 7.900   Median :0.5200   Median :0.2500   Median :2.200  
##  Mean   : 8.259   Mean   :0.5288   Mean   :0.2661   Mean   :2.409  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.:2.600  
##  Max.   :13.200   Max.   :1.5800   Max.   :1.0000   Max.   :8.300  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.25      
##  Median :0.07900   Median :13.00       Median : 37.00      
##  Mean   :0.08699   Mean   :15.17       Mean   : 44.52      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 60.00      
##  Max.   :0.61100   Max.   :46.00       Max.   :144.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9967   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.316   Mean   :0.6569   Mean   :10.43  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7275   3rd Qu.:11.10  
##  Max.   :1.0029   Max.   :4.010   Max.   :2.0000   Max.   :14.00  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 'data.frame':    1534 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The long tailed total sulfur dioxide and sulphates data should be transformed for a more accurate distribution. The log10 transformation can be used to produce a relatively normal distribution for both. Let’s see how the log10 transformation would work for both variables:

As can be observed from the above graph, the log10 transformation works great for sulphates variable.

Same result can be obtained for total sulfur dioxide and comparing the following graphs, log10 transformation seems to be useful.

Fixed acidity and volatile acidity appear to be long tailed as well. Hence, log10 transformation should be be a good option. The following graphs prove this claim:

As we said before, Wine Quality is a categorical variable. We can create a new variable called Grade to group Quality into three distinct categories: bad, average, and excellent.

Create a new variable called Grade by converting quality into 3 groups:

  • Bad (Quality < 5)

  • Average (Quality = 5 or 6)

  • Excellent (Quality > 6)

Here are a count of the data for each of these three groups:

##       bad   average excellent 
##        62      1264       208

Univariate Analysis

What is the structure of your dataset?

There are 1534 observations left after slicing out the top 1% from the variables that had large outliers for the following variables: Fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the graphs I have seen so far, I believe residual sugar, pH, density and alcohol content have key roles in quality and may end up being selected for the final model.

Did you create any new variables from existing variables in the dataset?

Yes, I created a Grade variable which is a subset of quality based on three distinct categories: (bad: 4,5), (average: 5,6), (excellent: 7,8)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • The x column was removed as it was an index.
  • The top 1% of values were sliced off from the following variables: fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide.
  • Sulphates, fixed acidity, and total and/or free sulfur dioxide were long tailed. I used log10 transformation and it improved their distribution towards normal.

Bivariate Plots Section

To get a better look at correlations between two pairs of variables, ggpairs was used.

This scatterplot matrix is specifically very useful for the variable selection in the final model. Let’s recall that the main purpose of this analysis is to understand how chemical properties impact the wine quality (Response or Dependent Variable). We are interested to select the independent variables that have the highest correlation with Quality, so that the final model can be stronger in predicting Quality. Furthermore, we should avoid selecting independent variables that have high correlations between themslves, as this can cause multicolinearilty leading to inaccuracy in the estimation of the model parameters (Coeeficients). For instance, having both free.sulfur.dioxide and total.sulfur.dioxide in the model is not suggested, as there is a high correlation between the two variables( > 0.60).

In this section, I investigate the correlations between some of the independent variables.Based on the scatterplot matrix shown above, we notice some interesting relationships between the following variables: Citric Acid and pH (~ -0.53), Citric Acid and Volatile Acidity (~ -0.56). However, none of these variables seem to be strongly correlated to alcohol. Meanwhile, alcohol and quality have a 0.48 correlation coefficient. Hence, alcohol can be a good candidate to be in the final model.

Firstly I ploted pH and fixed acidity. The correlation coefficient is -0.68, meaning that pH tends to decrease as fixed acidity increases, which makes sense chemically speaking.

## [1] -0.6794406

The correlation between citric acid and pH is weaker, as it is calculated as -0.53. This makes sense as citric acid is a subset of fixed acidity.

## [1] -0.5283267

Volatile acidity has a weak positive correlation with pH level (0.23).

## [1] 0.2387919

## [1] -0.5629224

As it can be seen in the graph, there is clearly a negative correlation between volatile acidity and citric acid. Chemically speaking, as volatile acidity is essentially acetic acid, a large amount of both ingredientss would likely not be included in a wine.

## [1] 0.2166557

There is not much relationship between alcohol and pH.

To further explore alcohol, pH, volatile acidity, citric acid, and sulphates and see how they relate to the quality of the wine, Box plots are used and we use the median as a better measure for the variance in the data.

## df$Grade: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.303   3.380   3.385   3.500   3.900 
## -------------------------------------------------------- 
## df$Grade: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.210   3.310   3.315   3.402   4.010 
## -------------------------------------------------------- 
## df$Grade: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.280   3.295   3.380   3.780

## df$Grade: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.20   10.97   13.10 
## -------------------------------------------------------- 
## df$Grade: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.50   10.00   10.26   10.90   14.00 
## -------------------------------------------------------- 
## df$Grade: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.50   10.80   11.60   11.54   12.22   14.00

## df$Grade: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## df$Grade: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## df$Grade: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

## df$Grade: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0750  0.1713  0.2675  1.0000 
## -------------------------------------------------------- 
## df$Grade: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2538  0.4000  0.7600 
## -------------------------------------------------------- 
## df$Grade: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.3950  0.3687  0.4900  0.7600

## df$Grade: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4925  0.5600  0.5927  0.6000  2.0000 
## -------------------------------------------------------- 
## df$Grade: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6457  0.7000  1.9800 
## -------------------------------------------------------- 
## df$Grade: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7444  0.8200  1.3600

The boxplots provide a very interesting fact about alcohol: Alcohol content is significantly higher for excellent wines compared to bad or average wines. Sulphates and citric acid also seem to be positively correlated to quality, whearas volatile acidity appear to be negatively correlated with Quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

It appears that citric acid and sulphates are positively related. Volatile acidity and citric acid are negatively correlated. Citric acid and pH were also negatively correlated. Other interesting observations are as following:

  • The median for sulphates increased under each one of the quality types.
  • Citric acid had the highest concentration for excellent wines. With medians of 0.075 for bad, 0.24 for average, and 0.395 for excellent.
  • As volatile acidity increases, the median for the wine Quality decreases, with medians of 0.68 for bad, 0.54 for average, and 0.37 for excellent.
  • The median for alcohol content (10%) was the same for bad or average wine. However, for the excellent wines, the alcohol content was 11.6%. This may lead us to conclude a higher alcohol content may make a wine excellent from average. However, there are other factors involved that could affect the wine Quality.
  • pH didn’t change significantly much between different wine categories, with medians around 3.2 to 3.3 for bad, average, and excellent wines.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity and citric acid, as well as citric acid and pH were negatively correlated. Fixed acidity and pH were also negatively correlated.

What was the strongest relationship you found?

The strongest relationship was between Citric Acid and Volatile Acidity, which had a correlation coefficient of -0.563.

Multivariate Plots Section

When comparing sulphates to alcohol, it was noticed that for average wines, quality increased typically as sulphates increased. Furthermore, for excellent wines, it appeared that alcohol played a more important role in determining quality given a specific sulphate level.

In this section, I have created some scatterplots for a few variables of interest faceted by quality Grade (Bad, Average and Excellent) to look for relationships and additional insights. It is worth to note that I have used a sequential color table as well as a regression line for each category that can strongly help in depicting the separations.

Sulphates vs Alcohol faceted by quality Grade and sequential quality color

As it can be seen in the above chart, the range (-0.1,0) for log10 sulphate and the alcohol level of around 12, leads to the best quality score of 8 which is also an Excellent grade.

We know that citric acid affects quality as well. It appeared that at a given level of citric acid, higher alcohol content typically meant greater wines, with the exception of bad wines. It’s likely that the bad wines have a different factor, which masks the benefits of the added alcohol.

Citric acid vs Alcohol faceted by quality Grade and sequential quality color

I’m interested to learn what variable(s) are responsible for bad wine. From all the observations I have seen so far, I decide to pick chlorides, residual sugar, and volatile acidity to find out if they may cause bad Wines. Since lower citric acids were found in bad, average, and excellent wines, it is used as the test subject to make further inferences.

Chlorides vs Citric Acid faceted by quality Grade and sequential quality color

It can be seen that, for a given level of chlorides, there are many average wines and some excellent wines that also have the same citric acid value. Additionally, most wines have similar levels of chlorides. Hence, chlorides can be off the table.

Residual Sugar vs Citric Acid faceted by quality Grade and sequential quality color

As it can be seen below, residual sugar content is neither the variable causing bad wines.

Volatile Acidity vs Citric Acid faceted by quality Grade and sequential quality color

The above graph, however, illustrates that most bad wines seem to have higher levels of volatile acidity, and most excellent wines also have lower levels of volatility.

Volatile Acidity vs alcohol faceted by quality Grade and sequential quality color

For the upper right cluster under bad wines, it can be seen that the higher alcoholic content of the wines are being masked by the high volatile acidity (0.8 or higher).

Volatile Acidity vs sulphates faceted by quality Grade and sequential quality color

Comparing volatile acidity with sulphates, it can be concluded that excellent wines have a lower volatile acidity and a higher sulphates content, whereas bad wines have a lower sulphates content and higher volatile acidity content.

Linear Model

Based on the graphs and analysis performed, I used the four major variables: alcohol, sulphates, citric acid, and volatile acidity to build a linear model by considering Quality as the response variable. The table and graph below displays the results:

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = df)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = df)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid, 
##     data = df)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid + 
##     volatile.acidity, data = df)
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)           1.734***      1.243***      1.296***      2.492***  
##                        (0.179)       (0.181)       (0.180)       (0.207)    
##   alcohol               0.374***      0.358***      0.351***      0.322***  
##                        (0.017)       (0.017)       (0.017)       (0.016)    
##   sulphates                           1.000***      0.824***      0.711***  
##                                      (0.104)       (0.108)       (0.105)    
##   citric.acid                                       0.511***     -0.087     
##                                                    (0.096)       (0.108)    
##   volatile.acidity                                               -1.242***  
##                                                                  (0.117)    
## ----------------------------------------------------------------------------
##   R-squared             0.239         0.282         0.295         0.344     
##   adj. R-squared        0.238         0.281         0.294         0.342     
##   sigma                 0.707         0.687         0.681         0.657     
##   F                   479.886       300.740       213.633       200.172     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1644.831     -1599.682     -1585.485     -1530.839     
##   Deviance            766.824       722.987       709.728       660.922     
##   AIC                3295.663      3207.363      3180.970      3073.678     
##   BIC                3311.670      3228.706      3207.648      3105.692     
##   N                  1534          1534          1534          1534         
## ============================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As described in the above analysis, four features of alcohol, sulphates, citric acid, volatile acidity are the most important ones to explain the feature of interest: quality. A summary of the relationship between the features are as follows:

  • Citric acid and Alcohol: There is definitely a relationship between alcohol content and citric acid with respect to the Quality of wine. For instance, lower quality wines tended to be lower in alcohol content and citric acid. Alcohol content made average wines taste better no matter to the citric acid content. Additionally, excellent wines tended to be higher in alcohol content and citric acid.

  • For average wines, Sulphates versus citric acid showed that sulphates were mainly larger. However, for excellent wines, a higher citric acid content resulted in an excellent wine at a given level of sulphates. One may conclude that citric acid is more important than sulphates with regards to what makes a wine excellent. However, a sulphate content between -0.25 and 0 was necessary in order for a wine to be sufficient. Therefore, this strengthens the idea that low sulphate quality played a key role in average or bad wines.

  • The relationship between alcohol and volatile acidity was an interesting one, as a low volatile acidity rating appeared to be a requirement in order for a wine to be excellent. There are lots of average wines with volatile acidity between 0.4 and 0.8 and alcohol content between 9 and 10%, whereas most excellent wines have majority of the volatility between 0.1 and 0.4. Bad or average wines were generally over 0.4 volatile acidity no matter what the alcoholic content is.

  • High volatile acidity and low sulphates were a strong indicator of the presence of bad wine. Higher alcohol content, lower volatile acidity, higher citric acid, and lower sulphates altogether resulted in a good wine.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

As explained in the above, a linear model was built by using four of the variables which appeard to be the most important features in describing the feature of interest: Wine Quality. These four variables are: alcohol, citric acid, sulphates, and volatile acidity. For obvious reasons, the model is far from the best possible model, as I used a linear model for simplicity. However, more advanced analysis may need to be performed to obtain the best model.


Final Plots and Summary

Plot One: Alcohol and Quality

Description One

This graph illustrated that a higher alcohol content needed in general for excellent wines. The jump from an average wine to an excellent wine typically requires an alcohol level of close to 12 and more. It should be noted that other factors should not be ignored, as we will see in plot 2 and 3.

Plot Two: Volatile Acidity vs Quality

Description Two

This graph clearly shows what was stated at last for the description of plot one. More specifically, higher level of alcohol is necessary for a good Quality wine but that is not sufficient. As it can be seen in the above chart, at volatile acidity level of greater than 0.8, the increase in the alcohol level would not impact the wine quality from the bad grade. Furthermore, a volatile.acidity level of between 0.4 to 0.8 typically results in an average wine, and for volatile.acidity level of less than 0.4, the famous jump of quality from average to excellent as a result of alcohol increase (similar to the one seen in Plot One) can be observed again.

Plot Three: Alcohol & Sulphates vs. Quality

Description Three

From this graph, it can be seen that lower sulphates content typically leads to a bad wine where alcohol varying between 10% and 12%. Furthermore, average wines have higher sulphates in general. Nevertheless, alcohol content still plays a role and need to be higher as well, for higher Sulphates resulting in an average wine. Lastly, excellent wines are mostly clustered around higher alcohol contents (11-12%) as well as higher sulphate contents (-0.1,0) (for log10 sulphate).


Reflection

When I learned about this dataset, it automathically made me interested, as in general I like Wines and it was very interesting to learn in that much details about the ingredients and which are the main ones to affect the Quality of Wines. Overal, I believe this was a successful data analysis experience since I was able to throughly explore different features and compare them with respect to the feature of interest, the wine quality variable, build a simple model off of the most important features and obtain a fearly clear understanding of the factors that makes a quality wine. As for struggles, I can say that although there were not too many variables/features, it was not an easy task for me to find the most important and insightful selection of variables to be plotted or analyzed, and I needed several trials to obtain an insightful analysis. I believe that is the main part of the fun though! The next steps for me would be to try more advanced models on the data such as KNN or Decision Tree models.

References

  1. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis, Modeling wine preferences by data mining from physicochemical properties, In Decision Support Systems, Volume 47, Issue 4, 2009, Pages 547-553,ISSN
    0167-9236, https://doi.org/10.1016/j.dss.2009.05.016.

    (http://www.sciencedirect.com/science/article/pii/S0167923609001377)

  2. Dataset link: http://www3.dsi.uminho.pt/pcortez/dss09.bib

  3. http://r4stats.com/examples/graphics-ggplot2/

  4. http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_ scales_in_my_charts_and_graphs

  5. https://www.r-bloggers.com/multiple-regression-lines-in-ggpairs/